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Abstract 

Deep directed generative models have attracted much attention recently due to 
their expressive representation power and the ability of ancestral sampling. One 
major difficulty of learning directed models with many latent variables is the in¬ 
tractable inference. To address this problem, most existing algorithms make as¬ 
sumptions to render the latent variables independent of each other, either by de¬ 
signing specific priors, or by approximating the true posterior using a factorized 
distribution. We believe the correlations among latent variables are crucial for 
faithful data representation. Driven by this idea, we propose an inference method 
based on the conditional pseudo-likelihood that preserves the dependencies among 
the latent variables. For learning, we propose to employ the hard Expectation Max¬ 
imization (EM) algorithm, which avoids the intractability of the traditional EM by 
max-out instead of sum-out to compute the data likelihood. Qualitative and quan¬ 
titative evaluations of our model against state of the art deep models on benchmark 
datasets demonstrate the effectiveness of the proposed algorithm in data represen¬ 
tation and reconstruction. 


1 Introduction 

Deep directed generative models have received increasing attention recently, because 
the top-down connections explicitly model the data generating process. Different levels 
of latent variables capture features (or abstractions fl5j ) in a coarse-to-fine manner. 
Compared with undirected models such as restricted Boltzmann machines (RBMs) and 
deep Boltzmann machines (DBMs) 03 , directed generative models have their own 
advantages. First, samples can be easily obtained by straightforward ancestral sampling 
without the need for Markov chain Monte Carlo (MCMC) methods. Second, there is 
no partition function issue since the joint distribution is obtained by multiplying all 
local conditional probabilities, which requires no further normalization. Last but most 
importantly, the latent variables are dependent on each other given the observations 
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through the so-called “explain-away” principle. Through their inter dependency, latent 
variables coordinate with each other to better explain the patterns in the visible layer. 

Learning directed models with many latent variables is challenging, mainly due 
to the intractable computation of the posterior probability. Although Markov Chain 
Monte Carlo (MCMC) method is a straightforward solution, the mixing stage is often 
too slow. To simplify the inference for deep belief networks, Hinton et al. El intro¬ 
duced a complementary prior for the latent variables which makes the posterior fully 
factorized. Some recent efforts for learning generative models have focused on varia¬ 
tional methods El El ED , by introducing another distribution to approximate the true 
posterior and maximize a variational lower bound of the data likelihood. The approxi¬ 
mating distribution is typically fully factorized for computational efficiency. However, 
the assumption of the factorized distribution sacrifices the “ explain-away” effect for 
efficient inference, which inevitably enlarges the distance to the true posterior, and 
weakens the representation power of the model. This defeats a major advantage of 
directed graphical models. 

In this work, we address the problem of learning deep directed models in a dif¬ 
ferent direction. We propose to use the EM algorithm with two approximations in the 
inference and learning phases. First, we approximate the true posterior distribution dur¬ 
ing inference by the conditional pseudo-likelihood, which preserves to certain degree 
the dependencies among latent variables. Second, we approximate the data likelihood 
using a max-out setting during the E-step of the learning to overcome the exponen¬ 
tial number of configurations of the latent variables. As a result, the E-step requires 
the maximum a posteriori (MAP) inference, which is efficiently solved based on the 
pseudo-likelihood. It can also be seen as the application of iterated conditional modes 
(ICM) £3 to directed graphical models. In the M-step, the problem is transferred into 
parameter learning with complete data, which is much easier to handle. 

2 Related Work 

The research on learning directed model with latent variables can be dated back to 
1990s. A standard approach is the Expectation Maximization (EM) algorithm, which 
maximizes the expected data log-likelihood for parameter learning. EM algorithm and 
its variants have been used for learning latent mixture of factor analyzers 0, prob¬ 
abilistic latent semantic indexing tm probabilistic latent semantic analysis ED and 
latent Dirichlet allocation (LDA) 0. Such models have a few latent variables so that 
the posterior probability of latent variables can be exactly computed in the E-step. 

In the case of many latent variables, exact computation of the posterior is intractable 
because of the exponential number of the latent variable configurations. Patel et al. da 
make one latent variable connecting to a small patch of the input data. Therefore each 
patch and the corresponding latent variable form a small model, which allows exact 
maximum a posteriori (MAP) inference. Many other approaches have been proposed 
to approximate the posterior probability of latent variables along two directions. 

One approach is to replace the true posterior distribution with a factorized dis¬ 
tribution as an approximation. This approach was first proposed by Saul et al. fl20l . 
known as the mean field theory for learning sigmoid belief networks. A fully fac- 
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torized variational posterior is introduced to approximate the true posterior distribu¬ 
tion of latent variables. Recently, Gan et al. 0 extended the mean field method, and 
proposed a Bayesian approach to learn deep sigmoid belief networks by introducing 
sparsity-encoraging priors on the model parameters. Alternatively, the posterior distri¬ 
bution can be approximated using a feed-forward network. The wake-sleep algorithm 
ff9l augments the multi-layer belief networks with feed-forward recognition networks. 
Wake-sleep alternates between updating the model parameters in the wake phase and 
the recognition network parameters in the sleep phase. Inspired by this idea, many ap¬ 
proaches have been proposed recently for learning directed graphical models by max¬ 
imizing a variational lower bound on the data log-probability. Mnih and Gregor ED 
introduced the neural variational inference and learning (NVIL) algorithm for sigmoid 
belief networks. A feed-forward inference network is used to obtain exact samples 
from the variational posterior. A neural network is introduced to reduce the variance of 
the samples. Kingma and Welling ED proposed the auto-encoding variational Bayes 
method for continuous latent variables, in which a reparameterization is employed to 
efficiently generate samples from the Gaussian distribution. Similarly, Rezende et al. 
fl6l propose a stochastic backpropagation algorithm for learning deep generative mod¬ 
els with continuous latent variables. Gregor et al. 0 augment the directed model with 
an encoder, which is also kind of inference network. 

Another direction is to make the posterior probability factorized by specifically 
designing a prior distribution of latent variables. Hinton et al. j8] proposed a comple¬ 
mentary prior to ensure a factorized posterior, and proposed a fast learning algorithm 
for deep belief networks (DBNs), which is basically a hybrid network with a single 
undirected layer and several directed layers. 

In all the above-mentioned methods, the inference typically assumes independency 
among latent variables due to special prior or factorized approximation in order to 
accelerate the inference. Because of this assumption, the inference network is not 
able to capture the correlations among the latent variables. Therefore the approximate 
posterior might differ significantly from the underlying true posterior. In this work, 
we intend to preserve the latent variable dependencies for better data representation. 
We approximate the posterior probability by the conditional pseudo-likelihood, and 
employ a hard version of the EM algorithm for parameter estimation. 

3 Latent Regression Bayesian Network 

We propose a generalized directed graphical model, called latent regression Bayesian 
network (LRBN), as shown in Fig. |T|(a). The latent variables in LRBN are binary, and 
the visible variables can be continuous or discrete. Each latent variable is connected to 
all visible variables. We discuss the parameterization of both cases in the sequel. The 
case of continuous latent variables can be referred to as factor analyzers BEE) or deep 
latent Gaussian models GHED- 
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3.1 Discrete LRBN 


For discrete LRBN, both latent and observation variables are discrete. For brevity, 
we discuss the binary case with observation variables x £ M nd and latent variables 
h £ M nh . We assume that the latent variables h determine the patterns in data x, 
therefore directed links are used to model their relationships, as in a Bayesian network. 

Prior probability for latent variables is represented as a log-liner model, 


p {hj = 1) = a{dj ), (1) 

where dj is the parameter defining the prior distribution for node hy, a(•) is the sigmoid 
function er(z) = 1/(1 + e~ z ). 

The conditional probability given the latent variables is, 

P(xi = 1 | h) = aCy^Wjjhj + bi ), ( 2 ) 

j 


where Wij is the weight of the link connecting node hj and x t \ bi is the offset for node 
Xi. The joint probability is, 


P e {x,h) = 


i ' x \ ) (y.,.i ll ''j- r ' ll j ■ 11. 1°S (l + exp Wijhj + bi 


n,(l+exp(cy 


(3) 


In this case, the model becomes a sigmoid belief network (SBN) M with one latent 
layer. If more layers are added on top, the conditional probability is defined in the 
same way as Eq. [2] The model is named a regression model based on the nature of the 
conditional probability. For discrete visible node, the input to the sigmoid function is 
a linear combination of the latent variables; for continuous visible node, the mean of a 
visible node is a linear combination of its parent nodes. 

As a similar model with undirected links, RBMs (Fig. [T] (b)) have been widely 
used in the literature for feature learning and data representation. The joint probability 
defined by a discrete RBM is. 


Prbm{x, h ) = — exp | ^ ^ 


- Xihj + 'y ' dj hj 

i, j 3 


(4) 




Figure 1: Graph representation of the (a) directed and (b) undirected model for data 
representation. Each link is associated with a weight parameter. 
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Comparing Eq. [3]and[4] the additional terms in the numerator captures the correlations 
among them. This is the reason why P{h\x) is not factorized over individual latent 
nodes hj in LRBN, which is the major difference from RBM. An advantage of Eq. [3] 
is that every term can be computed given the values of all variables, without the issue 
of the intractable partition function Z. 

The discussion of hybrid LRBN is moved to a supplementary material due to the 
limited space. 

3.2 Hybrid LRBN 

For hybrid LRBN, the observation variables x £ R n '' are continuous while the latent 
variables are binary h £ W lh . The prior distribution of the latent variables is the same 
as Eq. [T] Given the latent variables, the visible variable is assumed to follow Gaussian 
distribution, whose mean is a linear combination of the latent variables, 


P ( Xi\h ) ~ J\f 


E 


w ijhj 


i — 1,..., Tid , 


(5) 


where wtj is the weight of the link connecting node hj and bi is the offset of the 
mean for node Xi\ o, is the standard deviation. To simplify the learning process, each 
component of the data is normalized to have zero mean and unit variance, therefore di 
is set to 1. From the prior distribution and conditional distribution, the joint distribution 
for x and h is. 


exp (— jj\\x — Wh — b\\ 2 + d T h ) 
( 27r )n d / 2 jj. (i +exp (dj)) 


(6) 


or, 


p e(x, h) ^ nd/2 ^ ^ + exp ( dj )) 

exp^-(x — b) T (x—b)—x T Wh+-h T W T Wh—d T h^J . 

For brevity, vector and matrix forms are used, W = {wij}, b = {bi}, d = {dj}. 
9 = { W, b , d} represent all the parameters. 

For real-valued input data and binary latent variables, Gaussian-Bernoulli RBM 
defines the joint probability of visible and latent layer. 


PRBM{x,h)=—exp 


-(x — b) (x—b)+x Wh+d h 


( 8 ) 


where Z is the partition function to make / j rbm(x, h) a valid probability distribution. 
The input data is assumed to be normalized to have unit variance. 

Comparing LRBN (Eq. [7]) and RBM (Eq. |8j, with the same dimensionality of the 
visible and latent layer, the two models have the same amount of parameters. However, 

the directed model has a quadratic term h T W T Wh = JA i^2j w ijhj ) > which does 
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not exist in the joint distribution of RBM. This term explicitly captures the correlations 
among the latent variables h. It also explains why given the visible layer, the latent 
variables are dependent on each other. 


4 LRBN Inference 


In this section, we introduce an efficient inference method for LRBN based on con¬ 
ditional pseudo-likelihood. Given a LRBN model with known parameters, the goal of 
inference is to compute the posterior probability of the latent variables given input data, 
i.e., computing P(h\x). 

In this work, we are interested in the maximum a posteriori (MAP) inference, which 
is to find the configuration of latent variables that maximizes the posterior probability 
given observations, 

h* = argmaxP(ft.|a;). (9) 

h 

The MAP inference is motivated by the observation that from the data generating point 
of view, the variables in one latent layer take values according to the conditional prob¬ 
ability given its upper layer. Therefore this configuration dominates all the others in 
explaining each data sample. In addition, the goal of feature learning is to learn a fea¬ 
ture h that best explains x. In this regard, we only care about the most probable states 
of the latent variables given the observation. 

Because of the dependencies among elements of h , direct computing P(h\x) is 
computationally intractable, in particular when the dimension of h is high. According 
to the chain rule, the posterior probability of h is, 


P{h\x) = ~[[P(hj\hi,.. ,,hj- lt x), (10) 

3 

The pseudo-likelihood replaces the conditional likelihood by a more tractable objec¬ 
tive, i.e., 

P{h\x) « Y[ p (hj\h-j,x), (11) 

3 

where h_j = {hi ,..., hj-i,hj+i ,..., h nh } is the set of all latent variables except 
hj. In this approximation, we add conditioning over additional variables. 

The conditional pseudo-likelihood can be factorized into local conditional proba¬ 
bilities, which can be estimated in parallel. To optimize over the pseudo-likelihood, 
one latent variable is updated by fixing all other variables, 

h t+1 = arg max P(hj \x, h*_j), 1 < j < n h , (12) 

J h; J 


where t denotes the f th iteration. 


Theorem 1 The updating rule (Eq. 12) guarantees that the posterior probability P(h\x) 
will only increase or stay the same after each iteration. 


Piht+W-jW^Pih^x). 


(13) 
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The conditional probability of one latent variable by fixing all other variables is 
easy to compute, since it involves the computation of the joint distribution twice. 


P(hj\x, h—j) 


P(hj,x, h_j) 
Eh' Pihpx, h—j) ' 

3 J 


(14) 


In general, computing the joint probability P{x , h) has complexity 0{ndrih)- If each 
latent variable is updated t times to get h*, the overall complexity for the LRBN in¬ 
ference is 0{tnditfi), which is much lower than the 0(2 nh ndrih) complexity when 
computing P{h\x) directly. 

The updating method can be seen as a coordinate ascent algorithm or the iterated 
conditional modes (ICM) as in inference of Markov random fields. Typically in MRF 
the number of neighbors for one node is limited. In the case of LRBN, one latent 
variable is related to all the other variables, due to the rich dependencies encoded in the 
structure. As discussed above, existing methods to address the inference intractability 
problem makes the posterior probability completely factorized, therefore sacrificing 
the dependencies among the latent variables. In contrast, through pseudo-likelihood, 
we can preserve the dependencies to certain degree. 

The inference method requires an initialization for the hidden variables. Different 
initializations will end up with different local optimal points. To obtain consistent 
initialization, we drop the direction of the links, and treat the directed model as an 
undirected one. Therefore, the latent variables are independent of each other given the 
observations. Specifically, for binary input, 


P(hj = l|tr) = a(y^ WjjXj + dj). (15) 

i 

For continuous input, based on Eq. [ 7 ] we drop the off-diagonal terms of matrix W T W 
for the sake of efficiency, resulting in a factorized distribution of the latent variables, 

P(hj = l\x) = er(^ WijXi + dj - Sj) , (16) 

i 

where (si,..., s nh ) = diag(^W T W). For the initialization, the dependency among 
the latent variables is ignored, and then through coordinate ascent, it is recovered by 
updating a subset of variables with others fixed. 


5 LRBN Learning 

In this section, we introduce an efficient LRBN learning method based on the hard 
Expectation Maximization (EM) algorithm. The conventional EM algorithm is not an 
option here due to the intractability of computing posterior probability in the E-step. 
The hard version of EM algorithm has been explored in ll22ll for learning a deep Gaus¬ 
sian mixture model. This model has a deep structure in terms of linear transformations, 
but only has two layers of variables. 
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5.1 Learning One Latent Layer 


Consider the model Pg(x, h) defined in section 3. The goal of parameter learning is 
to estimate the parameters 6 = {u>, b, d} given a set of data samples D = 

The conventional maximum likelihood (ML) parameter estimation is to maximize the 
following objective function, 

0* = argmax^log^Pg(x^ m \/i). (17) 

m h 

The second summation in Eq. [I7]is intractable due to the exponentially many configu¬ 
rations of h. In this work, we employ a max-out estimation of the data log-probability, 
with the following objective function, 

0* = arg max ^ log max Pg (a 'S m \h). (18) 


Note that the max-out approximation of the data likelihood is not equivalent to 
approximating P(h\x) with a delta function. A delta function must be avoided since it 
is also factorized, which defeats our motivation of preserving the dependency. 

With objective function Eq. [18] the learning method becomes a hard version of the 
EM algorithm, which iteratively fills in the latent variables and update the parameters. 
In the E-step, h* = arg max/, P(x. h) is effectively estimated using the proposed in¬ 
ference method. In the M-step, the problem of parameter estimation is straightforward 
because now we are dealing with complete data. 

For binary observations, the parameter learning can be decomposed into learning 
local CPD for each variable, by solving multiple logistic regression problems. The 
gradient of the parameters is, 


<91ogP(x, h) 
dwa 


= hj (Xi - P{xi = l|/i)) 


d\ogP(x, h) 
dbi 


= Xi - P(xi = 1| h). 


(19) 


( 20 ) 


In hybrid LRBN, the objective function is convex, and the maximization likelihood 
solution can be obtained by setting. 


d\ogP{x^ m \h^) _ Q 


dd 


The solution of parameters has a closed form. 




-1 


( 21 ) 


( 22 ) 


In case of large datasets, it is time consuming because all the training instances are 
used to compute the gradient. Stochastic gradient ascent algorithm can be used to 
address this issue. The true gradient is approximated by the gradient at a minibatch of 
training samples. As the algorithm sweeps through the entire training set, it performs 
the gradient updating for each training sample. Several passes is made over the training 
set until the algorithm converges. The learning method is summarized in Algorithm]]] 
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Algorithm 1 Unsupervised Parameter learning of an LRBN with one latent layer. 

Input training data X = {x^} 

Output parameters 9 of an LRBN 
1 : Initial parameters 9; 

2: Initialize the states ho for all the latent variables using some feed-forward model 
(Section[4|; 

3 : while parameters not converging, do 

4 : Random select a minibatch of data instances x £ X 

5 : Update the corresponding h for x by maximizing the posterior probability, using 

current parameters, 

h* = argmaxPg(£, h ). (23) 

h 

6: Compute the gradients using Eq. [19] and [20] Update the parameters, 

9 = 9 + AVe log P(x, h*) (24) 

7 : end while 


5.2 Learning Deep Layers 

The learning method of a two-layer LRBN not only provides the parameter 9, but also 
perform the MAP inference to obtain h* = arg max^ P(h\x). If we denote the features 
as h l and treat them as the input to another LRBN, the same learning procedure can be 
repeated to learn another layer of features h 2 . By doing this we stack another LRBN 
on top of the first one to build a deep model. 

In general, let h l denote the variables in the 1 th latent layer (0 < l < L,h° = x), 
and 9 l be the parameters involved between layer l and l + l. 

The parameter 9 l * is estimated as, 

9 l * = argmaxVlogmaxPe^’W,^ 1 ), 1<Z<L. (25) 

0 l h l+1 

m 

To optimize the objective function, we use the stochastic gradient ascent method, and 
replace the input X by h 1 in Algorithm [l] By performing the layer-wise learning 
procedure from the first latent layer to the L th , we learn a deep model from bottom-up 
sequentially. Each time the MAP estimation of one latent layer is treated as input to 
the next two-layer model. For data instance x^ m \ 

/i i ' W =argmaxP(/ l i |/ i ; - 1 ’ (m) ), 1 <1<L, (26) 

h l 

where = x^ m \ The layer-wise pre-training procedure extracts different levels of 

features from the input data, and also provides an initial estimation of the parameters. 

5.3 Fine Tuning 

The layer-wise training ignores the parameters of other layers when training a model 
for each layer from bottom-up. To improve the model performance globally, we em- 
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ploy a fine tuning procedure from top-down after the layer-wise pre-training phase. 
Depending on whether the labels are available or not, the fine tuning can be done in 
either supervised or unsupervised manner as discussed below. 


5.3.1 Unsupervised Fine Tuning 


By extending Eq. 18 the objective function for learning with multiple latent layers is. 


= arg max 'S' log max Pg (x^ m \ h 1 ,..., h L ). 
0 2 {h 1 } 


(27) 


Given the states in layer l, the variables in layer Z — 1 is independent of the variables 
in layer l + 1. Therefore, the unsupervised fine-tuning is performed for every three 
consecutive layers. The variables in the middle layer is updated with its upper and 
lower layers fixed, 

h l * = arg max P(/i z | h l+1 ), 1 < l < L — 1. (28) 


The conditional probability P(h l \h l 1 , h l+1 ) is also approximated by the conditional 
pseudo-likelihood. 


p(h i \h i ~\h i+i )«■ < 29 ) 

3 

MAP inference is performed by updating one variable with all others fixed, 

h t+1 = avgmaxP(h l Ah l _A,h l ~ 1 ,h l+1 ), 1 < j <rih ■ (30) 

J h i 


To be consistent with bottom-up training, the initialization of the latent variable h l 
always follows Eq. 15 and [16] 

With the updated layer Jr 7we are able to update the parameters and 0 1 through, 


9 1 - 1 * = argmax^logP(/i'- 1 ^ m ),/i i ’( m )). (31) 

m 

and 

9 l * = argrnax^logP(/i'’ (m) ,/i m ’ (m) ). (32) 

m 

This process alternates between parameters updating and latent states updating. There¬ 
fore, the information is able to propagate among different layers, and the overall qual¬ 
ity of the model will increase. The fine tuning proceeds in a top-down manner start¬ 
ing from layers {L — 2,L—1,L} and ending with layers {0,1,2}. The bottom-up 
pre-training and top-down fine-tuning procedures are performed iteratively until the 
parameters converge. 
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5.3.2 Supervised Fine Tuning 


The parameter of the model can be fine-tuned discriminatively if the label information 
is available. Define a set of target variables t = (ti, with C being the total 

number of classes. t c = 1 if a sample belongs to class c, and tk = 0,Vfc 7 ^ c. The 
supervised fine-tuning contains three steps. First, a layer-wise pre-training is performed 
with L— 1 latent layers (excluding the top layer) using the method discussed in Section 


Second, the parameter 0 L for the top two layers is estimated as, 

=argrnax^logP(/i i - 1 ’ (m ),f (m) ). (33) 


5.2 obtaining 6 l and h 1 ,1 <(< L— 1. 


This step works with complete data, which does not require inference because t is 
always observed. Thees two steps form the bottom-up pre-training procedure. 

Third, the top-down fine tuning starts with layers {L — 2,L — 1,L} with h L = t 
and ends with layers {0,1, 2}. Latent states updating follows Eq. 28 and parameter 
updating follows Eq. [jj| and [32] 


6 Experiments 

In this section, we evaluate the performance of LRBN and compare against other meth¬ 
ods on three binary datasets: MNIST, Caltech 101 Silhouettes and OCR letters. Binary 
datasets are chosen to compare with other models. The extension to real-value datasets 
is straightforward. The experiments will evaluate representation and reconstruction 
power of the proposed model. 

6.1 Experimental protocol 

We trained the LRBN model using stochastic gradient ascent algorithm with learning 
rate 0.25. The size of the minibatches is set to 20. Two different structures are studied: 
one hidden layer with 200 variables, and two hidden layers with each layer containing 
200 variables, consistent with the configurations in EE). For each dataset, we ran¬ 
domly selected 100 samples from the training set to form a validation set. The joint 
probability on the validation set is a criterion for early stopping. 

In this section, we first evaluate the MAP configuration of the latent variables 
through reconstruction. Reconstruction is performed as follows: given a data vector x , 
perform a MAP inference to get h* = argmax;, P{h\x). Then x = argmax x P(x\h*) 
is the reconstructed data. The reconstruction error x — x\ 2 can evaluate how well the 
model fits the data. 

The second criterion is the widely used test data log-probability. Directly comput¬ 
ing probability P(x) is intractable due to the exponentially many terms in the sum¬ 
mation P{x) = ^Zi l P{x,h). In this work, we estimate the log-probability using the 
conservative sampling-based log-likelihood (CSL) method (TJ, 

logP(x) = log mean/jgs P(x\h ), (34) 
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where S' is a set of samples h of the latent variables collected from a Markov chain. The 
expectation of the estimator is a lower bound on the true log-likelihood (T). Because 
of the nature of directed models, samples can be collected from the ancestral sampling 
procedure. Specifically, the top layer is sampled from the prior probability P{h L ), 
and the lower layers are sampled from the conditional probabilities P(h l ~ l \h l ). One 
million samples are used to reach convergence of the estimation of log-probability, and 
the average of ten repetitions is reported. 

The reconstruction error evaluates the quality of the most probable explanation 
of the latent variables given observations, while the data log-probability evaluates the 
overall quality of all configurations of latent variables. They are two complementary 
criteria for model evaluation. 

In the experiment, we compare with published results if they are available. For 
reconstruction we implement the NVIL, RBM, DBN and DBM models following j4j 
HUES El (denoted by (*) in the following tables). Similar log-likelihood achieved by 
our implementation indicates the correctness of the implementation. 

6.2 MNIST dataset 

The first experiment is performed on the binary version of the MNIST dataset (thresh¬ 
olding at 0.5). The dataset consists of 70,000 handwritten digits with dimension 28x28. 
It is partitioned into a training set of 60,000 images and a testing set of 10,000 images. 

The average reconstruction errors of different learning models are reported in Ta¬ 
ble [T] The MAP inference of neural variational inference and learning (NVIL) fl3l is 
through the inference network. For deep belief network (DBN) |fl9l and deep Boltz¬ 
mann machine (DBM) fl7l . the posterior P{h\x) is already factorized, so that the in¬ 
ference is performed individually for each latent variable. The average reconstruction 
error of the proposed model is 4.56 pixels, which significantly outperforms the other 
competing methods. This is consistent with our objective function, indicating the most 
probable explanation contains most information in the input data, which is effectively 
captured in the proposed model. Some examples of the reconstruction are shown in 
Kg.|2] 


Table 1: Average reconstruction errors of different methods on MNIST dataset. (*) 
represents our own implementation, same for the following tables. 


Method 

DIM 

Recon Error 

NVIL* lfl3l 

200 - 200 

35.52 

dbn* m 

200 - 200 

29.78 

DBM* Q7| 

200 - 200 

23.52 

LRBN 

200 - 200 

4.56 


In Table [2] we report the average log-probability on the test set. With the same 
dimensionality, LRBN outperforms variational Bayes 0, and is similar to that learned 
using NVIL oca. Even though our objective function does not explicitly maximize 
the data likelihood, the learned model achieves comparable performance compared 
with state of the art learning methods, which indicates that the proposed method is also 


12 








Table 2: Test data log-probabilities of different models using on MNIST dataset. 


Method 

DIM 

10k 

VB o 

200 

-116.91 

VB d 

200 - 200 

-110.74 

NVIL fl3l 

200 

-113.1 

NVIL nli 

200 - 200 

-99.8 

DBN ||T9| 

500 - 2000 

-86.22 

DBM Q7] 

500 - 1000 

-84.62 

LRBN 

200 

-108.7 

LRBN 

200 - 200 

-100.3 
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Figure 2: Examples of the reconstruction, (a) Original digit images, (b) reconstructed 
by the proposed model. 


effective in capturing the distribution of the training data. In all algorithms, introducing 
a second hidden layer improves the performance. Our method achieves almost 8 nats 
improvement when additional latent layer is used. Some samples from the generative 
model are given in Fig. [3] 

There is still a gap between the proposed learning method and DBN or DBM. One 
reason is that we do not have as many latent nodes as in DBN and DBM. Moreover, 
the max-out approximation of the data likelihood during learning drops all the non¬ 
dominant configurations of the latent variables. Therefore it does not necessarily per¬ 
form well on the task of likelihood comparison. However, it still achieves comparable 
or even better performance compared to other learning methods. 

6.3 Caltech 101 Silhouettes dataset 

The second experiment is performed on the Caltech 101 Silhouettes dataset. The 
dataset contains 6364 training images and 2307 testing images. The reconstruction er¬ 
ror is reported in Table[3] The proposed learning method outperforms all the competing 
methods by a large margin, indicating the effectiveness of the max-out approximation. 
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Figure 3: Random samples from the generative model on MNIST dataset. 


The test data log-probability is reported in Table [4] With the same dimensionality, 
the model learned by the proposed algorithm outperforms the one learned by variational 
Bayes Q, which is considered as one of the state of the art methods of training sigmoid 
belief networks. With one hidden layer of size 200, the improvement is 34 nats; with a 
second hidden layer of size 200, the improvement is 28 nats. Moreover, compared to an 
RBM with much more parameters, our model also achieves better performance, indi¬ 
cating the importance of the underlying dependency of the latent variables. Examples 
are shown in Fig. [4] 


Table 3: Average reconstruction errors of different methods on Caltech 101 Silhouettes 
dataset. _ 


Method 

DIM 

Recon Error 

NVIL* |fl3] 

200 - 200 

29.78 

RBM* 0] 

200 

32.47 

DBN* fl9| 

200 - 200 

28.17 

DBM* OZD 

200 - 200 

24.90 

LRBN 

200 - 200 

5.95 


Table 4: Test data log-probability. 


Method 

DIM 

Log-prob 

VB |51 

200 

-136.84 

VB 0 

200 - 200 

-125.60 

RBM 0] 

4000 

-107.78 

DBN* GU 

200 - 200 

-120.46 

DBM* (HI 

200 - 200 

-118.73 

LRBN 

200 

-102.21 

LRBN 

200 - 200 

-97.49 


14 















Figure 4: Random samples from the generative model on Caltech 101 Silhouettes 
dataset. 

6.4 OCR letters dataset 

The last experiment is performed on the OCR letters dataset, which contains 42,152 
training images and 10,000 testing images of English letters. The images have the 
dimensionality of 16 x 8. 

The reconstruction error is reported in Table [5] The proposed method shows supe¬ 
rior performance compared to all the competing methods. The average reconstruction 
error on the test set is 5.95 pixels, which is at least 17 pixels better than the other 
methods. 

The test data log-probability is reported in Table[6] Our model obtains a variational 
lower bound of -35.02, which outperforms the variational Bayes learning method, and 
is slightly worse than DBM lfl8l . which has 100 times more parameters. Samples from 
the LRBN are shown in Fig. [5] We display the samples of letter ’g’. For the same letter, 
the learned model is able to capture the different handwriting styles, while preserving 
the key information. 




Table 5: 


Average reconstruction errors of different methods on OCR letters dataset. 


Method 

DIM 

Recon Error 

nvil* m 

200 - 200 

14.79 

RBM* (4) 

200 

16.83 

DBN*Q3 

200 - 200 

12.47 

DBM* JT7) 

200 - 200 

11.14 

LRBN 

200 - 200 

2.03 


Table 6: Test data log-probability on OCR letters dataset. 


Method 

DIM 

Log-prob 

VB |51 

200 

-48.20 

VB 0 

200 - 200 

-47.84 

DBN* fl9il 

200 - 200 

40.75 

DBM Cl) 

2000 - 2000 

-34.24 

LRBN 

200 

-39.48 

LRBN 

200 - 200 

-35.02 
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Figure 5: Random samples from the generative model on OCR letters dataset. 


7 Conclusion 

In this work, we introduce a directed deep model based on the latent regression Bayesian 
network to explicitly capture the dependencies among the latent variables for data rep¬ 
resentation. We introduce an efficient inference method based on pseudo-likelihood 
and coordinate ascent. A hard EM learning method is proposed for efficient parame¬ 
ter learning. The proposed inference method solve the inference intractability, while 
preserving the dependencies among latent variables. We theoretically and empirically 
compare different models and learning methods. We point out that the latent variables 
in regression Bayesian network have strong dependencies, which can better explain 
the patterns in the input layer. Experiments on benchmark datasets shows the pro¬ 
posed model significantly outperforms the existing models in data reconstruction and 
achieves comparable performance for data representation. 
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